Import at least pandas, numpy and matplotlib.pyplot
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Check where your data is and what's the name of files (use !dir)
!dir
El volumen de la unidad D es SSD Crucial
El n£mero de serie del volumen es: E602-7CD2
Directorio de D:\Code\DSPE\Task LOF
04/16/2021 10:13 PM <DIR> .
04/16/2021 10:13 PM <DIR> ..
04/07/2021 05:02 PM <DIR> .ipynb_checkpoints
04/16/2021 10:13 PM 2,699,210 instructions_without_code_final.ipynb
03/24/2021 05:35 PM 14,075 machine_2021_program_summary_.csv
03/24/2021 05:35 PM 580 tutorial_data.csv
04/07/2021 05:02 PM 736,525 tutorial_during_the_lecture_2021_04_07.ipynb
4 archivos 3,450,390 bytes
3 dirs 334,501,134,336 bytes libres
Volume in drive C is Windows
Volume Serial Number is 5441-C8EF
Directory of C:\Users\bialekj\Documents\Projekty\Wyklady\anomaly_detection_workshop\data
19.03.2020 09:06 <DIR> .
19.03.2020 09:06 <DIR> ..
19.03.2020 13:28 14˙075 machine_2021_program_summary_.csv
19.03.2020 09:06 <DIR> old
1 File(s) 14˙075 bytes
3 Dir(s) 4˙900˙331˙520 bytes free
Read the _machine_2021_programsummary file to pandas DataFrame called df
df = pd.read_csv('machine_2021_program_summary_.csv')
Check the first 10 rows of df
df.head(10)
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | A11 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | A13 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | A13 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | A13 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | A11 |
| 5 | 2019-09-19 20:03:51 | 2019-09-19 20:23:51 | -2.962862 | -7.298702 | -4.637185 | -3.323306 | 21.500000 | 21.500000 | A11 |
| 6 | 2019-09-21 04:03:55 | 2019-09-21 04:23:55 | 30.678012 | 35.091929 | 21.354449 | 28.506288 | -21.900000 | -21.900000 | A13 |
| 7 | 2019-09-26 17:20:48 | 2019-09-26 17:40:48 | 32.561561 | 32.761023 | 19.731399 | 18.239473 | 22.100000 | 22.100000 | A13 |
| 8 | 2019-09-26 17:40:48 | 2019-09-26 18:00:48 | 20.310017 | 38.026277 | 13.001814 | 22.507591 | -22.200001 | -22.200001 | A13 |
| 9 | 2019-09-27 11:16:58 | 2019-09-27 11:36:58 | -5.022353 | -7.529789 | -10.443645 | -9.293475 | 23.400000 | 23.400000 | A11 |
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | A11 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | A13 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | A13 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | A13 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | A11 |
| 5 | 2019-09-19 20:03:51 | 2019-09-19 20:23:51 | -2.962862 | -7.298702 | -4.637185 | -3.323306 | 21.500000 | 21.500000 | A11 |
| 6 | 2019-09-21 04:03:55 | 2019-09-21 04:23:55 | 30.678012 | 35.091929 | 21.354449 | 28.506288 | -21.900000 | -21.900000 | A13 |
| 7 | 2019-09-26 17:20:48 | 2019-09-26 17:40:48 | 32.561561 | 32.761023 | 19.731399 | 18.239473 | 22.100000 | 22.100000 | A13 |
| 8 | 2019-09-26 17:40:48 | 2019-09-26 18:00:48 | 20.310017 | 38.026277 | 13.001814 | 22.507591 | -22.200001 | -22.200001 | A13 |
| 9 | 2019-09-27 11:16:58 | 2019-09-27 11:36:58 | -5.022353 | -7.529789 | -10.443645 | -9.293475 | 23.400000 | 23.400000 | A11 |
List all the columns in df
df.columns
Index(['start', 'end', 'VibrationPeak_mean [dB]', 'VibrationPeak_median [dB]',
'VibrationCarpet_mean [dB]', 'VibrationCarpet_median [dB]',
'Temperature_median [°C]', 'Temperature_mean [°C]', 'program'],
dtype='object')
Index(['start', 'end', 'VibrationPeak_mean [dB]', 'VibrationPeak_median [dB]',
'VibrationCarpet_mean [dB]', 'VibrationCarpet_median [dB]',
'Temperature_median [°C]', 'Temperature_mean [°C]', 'program'],
dtype='object')
Check types of all of the columns. Use for loop in order to print the following statements.
sub1 = df.copy()
for col in df.columns:
print("Column {} is {}".format(col, df[col].dtype))
Column start is object Column end is object Column VibrationPeak_mean [dB] is float64 Column VibrationPeak_median [dB] is float64 Column VibrationCarpet_mean [dB] is float64 Column VibrationCarpet_median [dB] is float64 Column Temperature_median [°C] is float64 Column Temperature_mean [°C] is float64 Column program is object
Column start is object Column end is object Column VibrationPeak_mean [dB] is float64 Column VibrationPeak_median [dB] is float64 Column VibrationCarpet_mean [dB] is float64 Column VibrationCarpet_median [dB] is float64 Column Temperature_median [°C] is float64 Column Temperature_mean [°C] is float64 Column program is object
Now use DataFrame.info() to get the same information.
sub1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 89 entries, 0 to 88 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start 89 non-null object 1 end 89 non-null object 2 VibrationPeak_mean [dB] 89 non-null float64 3 VibrationPeak_median [dB] 88 non-null float64 4 VibrationCarpet_mean [dB] 89 non-null float64 5 VibrationCarpet_median [dB] 89 non-null float64 6 Temperature_median [°C] 89 non-null float64 7 Temperature_mean [°C] 89 non-null float64 8 program 89 non-null object dtypes: float64(6), object(3) memory usage: 6.4+ KB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 89 entries, 0 to 88 Data columns (total 9 columns): start 89 non-null object end 89 non-null object VibrationPeak_mean [dB] 89 non-null float64 VibrationPeak_median [dB] 88 non-null float64 VibrationCarpet_mean [dB] 89 non-null float64 VibrationCarpet_median [dB] 89 non-null float64 Temperature_median [°C] 89 non-null float64 Temperature_mean [°C] 89 non-null float64 program 89 non-null object dtypes: float64(6), object(3) memory usage: 6.4+ KB
Check if there are any NaN values in df
df.isnull().values.any()
True
True
Find the missing values
df[df.isnull().values]
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 12 | 2019-09-27 17:44:01 | 2019-09-27 18:04:01 | -2.897968 | NaN | -4.108349 | -4.903356 | -23.1 | -23.1 | A11 |
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 12 | 2019-09-27 17:44:01 | 2019-09-27 18:04:01 | -2.897968 | NaN | -4.108349 | -4.903356 | -23.1 | -23.1 | A11 |
Get percentage of missing data
len(df[df.isnull().values])/len(df)
0.011235955056179775
0.011235955056179775
Drop the row with missing value
sub1.dropna(inplace=True)
Reset index of the dataframe (since you removed one row), drop the old index. View first 15 rows. See if row #12 is there.
sub1 = sub1.reset_index(drop=True)
sub1.head(15)
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | A11 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | A13 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | A13 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | A13 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | A11 |
| 5 | 2019-09-19 20:03:51 | 2019-09-19 20:23:51 | -2.962862 | -7.298702 | -4.637185 | -3.323306 | 21.500000 | 21.500000 | A11 |
| 6 | 2019-09-21 04:03:55 | 2019-09-21 04:23:55 | 30.678012 | 35.091929 | 21.354449 | 28.506288 | -21.900000 | -21.900000 | A13 |
| 7 | 2019-09-26 17:20:48 | 2019-09-26 17:40:48 | 32.561561 | 32.761023 | 19.731399 | 18.239473 | 22.100000 | 22.100000 | A13 |
| 8 | 2019-09-26 17:40:48 | 2019-09-26 18:00:48 | 20.310017 | 38.026277 | 13.001814 | 22.507591 | -22.200001 | -22.200001 | A13 |
| 9 | 2019-09-27 11:16:58 | 2019-09-27 11:36:58 | -5.022353 | -7.529789 | -10.443645 | -9.293475 | 23.400000 | 23.400000 | A11 |
| 10 | 2019-09-27 11:36:58 | 2019-09-27 11:56:58 | -8.220528 | -5.825943 | -7.020347 | -9.362177 | -23.400000 | -23.400000 | A11 |
| 11 | 2019-09-27 17:24:01 | 2019-09-27 17:44:01 | -6.506379 | -10.573105 | -3.227738 | -6.040868 | 23.000000 | 23.000000 | A11 |
| 12 | 2019-10-01 18:44:17 | 2019-10-01 19:04:17 | 9.137873 | 15.062634 | 1.620201 | 3.346457 | 23.700001 | 23.700001 | A12 |
| 13 | 2019-10-03 18:19:38 | 2019-10-03 18:24:43 | 17.853521 | 18.031566 | 16.722668 | 11.515177 | -22.400000 | -22.400000 | A12 |
| 14 | 2019-10-03 18:24:46 | 2019-10-03 18:44:46 | 20.420528 | 18.664966 | 11.114553 | 15.407386 | 22.400000 | 22.400000 | A12 |
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | A11 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | A13 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | A13 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | A13 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | A11 |
| 5 | 2019-09-19 20:03:51 | 2019-09-19 20:23:51 | -2.962862 | -7.298702 | -4.637185 | -3.323306 | 21.500000 | 21.500000 | A11 |
| 6 | 2019-09-21 04:03:55 | 2019-09-21 04:23:55 | 30.678012 | 35.091929 | 21.354449 | 28.506288 | -21.900000 | -21.900000 | A13 |
| 7 | 2019-09-26 17:20:48 | 2019-09-26 17:40:48 | 32.561561 | 32.761023 | 19.731399 | 18.239473 | 22.100000 | 22.100000 | A13 |
| 8 | 2019-09-26 17:40:48 | 2019-09-26 18:00:48 | 20.310017 | 38.026277 | 13.001814 | 22.507591 | -22.200001 | -22.200001 | A13 |
| 9 | 2019-09-27 11:16:58 | 2019-09-27 11:36:58 | -5.022353 | -7.529789 | -10.443645 | -9.293475 | 23.400000 | 23.400000 | A11 |
| 10 | 2019-09-27 11:36:58 | 2019-09-27 11:56:58 | -8.220528 | -5.825943 | -7.020347 | -9.362177 | -23.400000 | -23.400000 | A11 |
| 11 | 2019-09-27 17:24:01 | 2019-09-27 17:44:01 | -6.506379 | -10.573105 | -3.227738 | -6.040868 | 23.000000 | 23.000000 | A11 |
| 12 | 2019-10-01 18:44:17 | 2019-10-01 19:04:17 | 9.137873 | 15.062634 | 1.620201 | 3.346457 | 23.700001 | 23.700001 | A12 |
| 13 | 2019-10-03 18:19:38 | 2019-10-03 18:24:43 | 17.853521 | 18.031566 | 16.722668 | 11.515177 | -22.400000 | -22.400000 | A12 |
| 14 | 2019-10-03 18:24:46 | 2019-10-03 18:44:46 | 20.420528 | 18.664966 | 11.114553 | 15.407386 | 22.400000 | 22.400000 | A12 |
Change the type of 'start' and 'end' to datetime. Show df.info() after the change.
sub1['start'] = pd.to_datetime(sub1['start'])
sub1['end'] = pd.to_datetime(sub1['end'])
sub1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 88 entries, 0 to 87 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start 88 non-null datetime64[ns] 1 end 88 non-null datetime64[ns] 2 VibrationPeak_mean [dB] 88 non-null float64 3 VibrationPeak_median [dB] 88 non-null float64 4 VibrationCarpet_mean [dB] 88 non-null float64 5 VibrationCarpet_median [dB] 88 non-null float64 6 Temperature_median [°C] 88 non-null float64 7 Temperature_mean [°C] 88 non-null float64 8 program 88 non-null object dtypes: datetime64[ns](2), float64(6), object(1) memory usage: 6.3+ KB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 88 entries, 0 to 87 Data columns (total 9 columns): start 88 non-null datetime64[ns] end 88 non-null datetime64[ns] VibrationPeak_mean [dB] 88 non-null float64 VibrationPeak_median [dB] 88 non-null float64 VibrationCarpet_mean [dB] 88 non-null float64 VibrationCarpet_median [dB] 88 non-null float64 Temperature_median [°C] 88 non-null float64 Temperature_mean [°C] 88 non-null float64 program 88 non-null object dtypes: datetime64[ns](2), float64(6), object(1) memory usage: 6.3+ KB
Plot values for all the parameters (columns) in the function of index.
sub1.plot(subplots=True, figsize=(10,9), style='o')
array([<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>,
<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>],
dtype=object)
Plot values for all the parameters (columns) as a function of column ['start']
sub1.plot(x='start', subplots=True, figsize=(10,9), style='o')
array([<AxesSubplot:xlabel='start'>, <AxesSubplot:xlabel='start'>,
<AxesSubplot:xlabel='start'>, <AxesSubplot:xlabel='start'>,
<AxesSubplot:xlabel='start'>, <AxesSubplot:xlabel='start'>,
<AxesSubplot:xlabel='start'>], dtype=object)
Plot distribution for vibration and temperature parameters
sub2 = sub1[['Temperature_mean [°C]','Temperature_median [°C]','VibrationCarpet_mean [dB]','VibrationCarpet_median [dB]',
'VibrationPeak_mean [dB]','VibrationPeak_median [dB]']]
sub2.hist(bins=20, figsize=(10,10));
Statistics for temperature look weird, right? We reached out to the client and find out that there was something wrong with the measurement. Lets drop it from now on and focus on vibration parameters.
sub1.head(5)
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | A11 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | A13 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | A13 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | A13 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | A11 |
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | A11 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | A13 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | A13 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | A13 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | A11 |
sub1.columns
Index(['start', 'end', 'VibrationPeak_mean [dB]', 'VibrationPeak_median [dB]',
'VibrationCarpet_mean [dB]', 'VibrationCarpet_median [dB]',
'Temperature_median [°C]', 'Temperature_mean [°C]', 'program'],
dtype='object')
Index(['start', 'end', 'VibrationPeak_mean [dB]', 'VibrationPeak_median [dB]',
'VibrationCarpet_mean [dB]', 'VibrationCarpet_median [dB]',
'Temperature_median [°C]', 'Temperature_mean [°C]', 'program'],
dtype='object')
Now find all possible combinations of vibration parameters (without repetitions). Print them out. (hint: use two for loops)
sub3 = sub1[['VibrationPeak_mean [dB]','VibrationPeak_median [dB]',
'VibrationCarpet_mean [dB]','VibrationCarpet_median [dB]']]
for i,col1 in enumerate(sub3.columns):
for j,col2 in enumerate(sub3.columns):
print(i,col1,j,col2)
0 VibrationPeak_mean [dB] 0 VibrationPeak_mean [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 0 VibrationPeak_mean [dB] 3 VibrationCarpet_median [dB] 1 VibrationPeak_median [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 1 VibrationPeak_median [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 2 VibrationCarpet_mean [dB] 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 2 VibrationCarpet_mean [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 3 VibrationCarpet_median [dB] 0 VibrationPeak_mean [dB] 3 VibrationCarpet_median [dB] 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 3 VibrationCarpet_median [dB]
for i,col1 in enumerate(sub3.columns):
for j,col2 in enumerate(sub3.columns):
if i != j and i<j:
print(col1,col2)
VibrationPeak_mean [dB] VibrationPeak_median [dB] VibrationPeak_mean [dB] VibrationCarpet_mean [dB] VibrationPeak_mean [dB] VibrationCarpet_median [dB] VibrationPeak_median [dB] VibrationCarpet_mean [dB] VibrationPeak_median [dB] VibrationCarpet_median [dB] VibrationCarpet_mean [dB] VibrationCarpet_median [dB]
VibrationPeak_mean [dB] VibrationPeak_median [dB] VibrationPeak_mean [dB] VibrationCarpet_mean [dB] VibrationPeak_mean [dB] VibrationCarpet_median [dB] VibrationPeak_median [dB] VibrationCarpet_mean [dB] VibrationPeak_median [dB] VibrationCarpet_median [dB] VibrationCarpet_mean [dB] VibrationCarpet_median [dB]
Plot scatter plots for all of the combinations (use code from previous task)
for i,col1 in enumerate(sub3.columns):
for j,col2 in enumerate(sub3.columns):
if i != j and i<j:
plt.scatter(sub3[col1],sub3[col2])
plt.xlabel(col1)
plt.ylabel(col1)
plt.title('{} vs {}'.format(col1,col2))
plt.show()
Seems that we have three clusters. Not sure of that? Let's make some fancy plots with seaborn (jointplot).
import seaborn as sns
for i,col1 in enumerate(sub3.columns):
for j,col2 in enumerate(sub3.columns):
if i != j and i<j:
sns.jointplot(x=col1, y=col2, data=sub3, marginal_kws={'bins':15});
standard joint plot, bins=15
c:\users\bialekj\appdata\local\programs\python\python36\lib\site-packages\seaborn\distributions.py:218: MatplotlibDeprecationWarning: The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead. color=hist_color, **hist_kws)
Hexagonal jointplot, bins=15, hex gridsize=12
for i,col1 in enumerate(sub3.columns):
for j,col2 in enumerate(sub3.columns):
if i != j and i<j:
sns.jointplot(x=col1, y=col2, data=sub3, kind='hex', joint_kws={'gridsize':12}, marginal_kws={'bins':15});
Didn't we forget about the program? Let's see how many programs do we have (value counts)
sub1.value_counts('program')
program A11 32 A12 29 A13 27 dtype: int64
A11 32 A12 29 A13 27 Name: program, dtype: int64
So there are three programs as well. Try to make scatter plots colored by program name and see if they correspond to the clusters we have found.
sub4 = sub1[['VibrationPeak_mean [dB]','VibrationPeak_median [dB]',
'VibrationCarpet_mean [dB]','VibrationCarpet_median [dB]','program']]
for i,col1 in enumerate(sub4.columns):
for j,col2 in enumerate(sub4.columns):
for k,col3 in enumerate(sub4.columns):
if i!=j and i<j and j!=4:
print(i,col1,j,col2,k,col3)
0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 0 VibrationPeak_mean [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 1 VibrationPeak_median [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 4 program 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 0 VibrationPeak_mean [dB] 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 1 VibrationPeak_median [dB] 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 2 VibrationCarpet_mean [dB] 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 4 program 0 VibrationPeak_mean [dB] 3 VibrationCarpet_median [dB] 0 VibrationPeak_mean [dB] 0 VibrationPeak_mean [dB] 3 VibrationCarpet_median [dB] 1 VibrationPeak_median [dB] 0 VibrationPeak_mean [dB] 3 VibrationCarpet_median [dB] 2 VibrationCarpet_mean [dB] 0 VibrationPeak_mean [dB] 3 VibrationCarpet_median [dB] 3 VibrationCarpet_median [dB] 0 VibrationPeak_mean [dB] 3 VibrationCarpet_median [dB] 4 program 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 1 VibrationPeak_median [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 2 VibrationCarpet_mean [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 4 program 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 0 VibrationPeak_mean [dB] 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 1 VibrationPeak_median [dB] 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 2 VibrationCarpet_mean [dB] 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 3 VibrationCarpet_median [dB] 1 VibrationPeak_median [dB] 3 VibrationCarpet_median [dB] 4 program 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 0 VibrationPeak_mean [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 1 VibrationPeak_median [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 2 VibrationCarpet_mean [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 3 VibrationCarpet_median [dB] 2 VibrationCarpet_mean [dB] 3 VibrationCarpet_median [dB] 4 program
for i,col1 in enumerate(sub4.columns):
for j,col2 in enumerate(sub4.columns):
for k,col3 in enumerate(sub4.columns):
if i!=j and i<j and j!=4 and k==4:
plt.figure(figsize=(10,10))
sns.scatterplot(x=col1, y=col2, data=sub4, hue="program", hue_order=["A11","A12","A13"], s=150)
plt.title('{} vs {}'.format(col1,col2))
plt.show()
Now you can see that some of the points are further away from their groups than it seemed at the beginnig (they belong to foreign clusters). How can we see that more clearly? Maybe we should include program in scatter plots?
How to plot categorical/string/textual/char variable? One idea may be to mapp it to integers.
Create a dictionary to map 'A11' to 1, 'A12' to 2 ...
Map values and show the dataframe.
sub5 = sub1.copy()
sub5.loc[sub5['program'] == 'A11','program'] = 1
sub5.loc[sub5['program'] == 'A12','program'] = 3
sub5.loc[sub5['program'] == 'A13','program'] = 2
sub5
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | 1 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | 2 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | 2 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | 2 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 83 | 2019-10-29 18:51:30 | 2019-10-29 19:11:30 | -5.548868 | -6.355704 | -10.301523 | -5.439592 | -21.799999 | -21.799999 | 1 |
| 84 | 2019-10-30 13:34:44 | 2019-10-30 13:54:44 | -5.811831 | -4.334031 | -0.468306 | -8.513778 | 21.700001 | 21.700001 | 1 |
| 85 | 2019-10-30 17:09:02 | 2019-10-30 17:22:30 | -1.294405 | -7.711131 | -5.414103 | -3.745550 | -21.400000 | -21.400000 | 1 |
| 86 | 2019-10-30 20:32:01 | 2019-10-30 20:52:01 | -5.099513 | -8.520697 | -13.016096 | -7.445794 | 21.500000 | 21.500000 | 1 |
| 87 | 2019-09-27 17:44:01 | 2019-09-27 18:04:01 | -2.897968 | -6.233262 | -4.108349 | -4.903356 | -23.100000 | -23.100000 | 1 |
88 rows × 9 columns
sub5.head(5)
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | 1 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | 2 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | 2 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | 2 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | 1 |
| start | end | VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | Temperature_median [°C] | Temperature_mean [°C] | program | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019-09-18 18:58:53 | 2019-09-18 19:18:53 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | -21.700001 | -21.700001 | 1 |
| 1 | 2019-09-18 19:18:53 | 2019-09-18 19:38:53 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 21.700001 | 21.700001 | 2 |
| 2 | 2019-09-19 09:54:53 | 2019-09-19 10:14:53 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | -22.299999 | -22.299999 | 2 |
| 3 | 2019-09-19 10:14:53 | 2019-09-19 10:34:53 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 22.200001 | 22.200001 | 2 |
| 4 | 2019-09-19 10:34:53 | 2019-09-19 10:54:53 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | -22.100000 | -22.100000 | 1 |
Make the scatter plots again, include 'program'
sub6 = sub5[['VibrationPeak_mean [dB]','VibrationPeak_median [dB]',
'VibrationCarpet_mean [dB]','VibrationCarpet_median [dB]','program']]
for i,col1 in enumerate(sub6.columns):
for j,col2 in enumerate(sub6.columns):
for k,col3 in enumerate(sub6.columns):
if i!=j and i<j and k==4:
plt.figure(figsize=(10,10))
sns.scatterplot(x=col1, y=col2, data=sub6, hue="program", palette=['green','blue','red'], s=150)
plt.title('{} vs {}'.format(col1,col2))
plt.show()
Now you can clearly see working points that are far away from the usual working points in a particular program.
Finally, let's calculate LOF.
Import LocalOutlierFactor from sklearn. If you don't have sklearn package - install it!
from sklearn.neighbors import LocalOutlierFactor
Create dataframe with columns you would like to include in training.
sub6
| VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | program | |
|---|---|---|---|---|---|
| 0 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | 1 |
| 1 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 2 |
| 2 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | 2 |
| 3 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 2 |
| 4 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | 1 |
| ... | ... | ... | ... | ... | ... |
| 83 | -5.548868 | -6.355704 | -10.301523 | -5.439592 | 1 |
| 84 | -5.811831 | -4.334031 | -0.468306 | -8.513778 | 1 |
| 85 | -1.294405 | -7.711131 | -5.414103 | -3.745550 | 1 |
| 86 | -5.099513 | -8.520697 | -13.016096 | -7.445794 | 1 |
| 87 | -2.897968 | -6.233262 | -4.108349 | -4.903356 | 1 |
88 rows × 5 columns
| VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | program | |
|---|---|---|---|---|---|
| 0 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | 1 |
| 1 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 2 |
| 2 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | 2 |
| 3 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 2 |
| 4 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | 1 |
| ... | ... | ... | ... | ... | ... |
| 83 | -5.548868 | -6.355704 | -10.301523 | -5.439592 | 1 |
| 84 | -5.811831 | -4.334031 | -0.468306 | -8.513778 | 1 |
| 85 | -1.294405 | -7.711131 | -5.414103 | -3.745550 | 1 |
| 86 | -5.099513 | -8.520697 | -13.016096 | -7.445794 | 1 |
| 87 | -2.897968 | -6.233262 | -4.108349 | -4.903356 | 1 |
88 rows × 5 columns
Define the classifier object. Have a look at the documentation, check out the description and examples. Define at least three LOF parameters - n_neighbors, novelty and contamination. Read the docs, have a look at your dataset and think of possible good values. Hint: Good values are the ones that give you result you expect :)
lof2 = LocalOutlierFactor(n_neighbors=10, contamination='auto', novelty=False)
Train the classifier object (fit to training data)
lof.fit(sub6)
LocalOutlierFactor(n_neighbors=10)
Get the negative outlier factor (check the docs again)
lof.negative_outlier_factor_
array([-1.05663666, -1.12918115, -0.96358471, -1.36992817, -1.43083105,
-1.00562152, -1.06542242, -1.6609807 , -2.152835 , -1.12830735,
-1.12208118, -0.93529643, -3.46487282, -1.17208657, -1.03261805,
-1.05794032, -1.10804777, -0.9600572 , -0.9668995 , -1.16045169,
-0.96417508, -1.097853 , -1.04124847, -0.96358053, -1.02410361,
-1.17417803, -1.03971873, -1.23648819, -1.04579194, -1.01311484,
-0.97781779, -1.1398881 , -1.25883338, -0.98298584, -0.97897699,
-1.01705972, -1.46264757, -1.31281301, -1.00288781, -0.9928458 ,
-1.00784891, -0.97427814, -1.0490519 , -1.28509574, -2.02352473,
-0.95705191, -0.98509853, -1.07312896, -1.03774076, -1.02430962,
-1.04183713, -0.98668067, -0.99249329, -0.96142413, -0.97004489,
-1.0390082 , -0.97559897, -1.21981806, -1.28802528, -1.04573126,
-0.98420793, -0.959407 , -1.15309822, -1.2082514 , -1.06992494,
-4.73278711, -1.18517409, -1.0245172 , -0.94253738, -1.21543405,
-1.0200107 , -0.97440239, -6.65200629, -0.99785117, -1.13296133,
-0.99068261, -1.53316828, -1.6010427 , -0.98249186, -1.06264464,
-0.9744652 , -1.84780867, -1.01940412, -1.15547654, -1.15645655,
-1.1209191 , -1.20875806, -0.98146582])
Make the scatter plots again, this time color the marks by the value of prediction
sub6_results = sub6.copy()
neg_values = lof.negative_outlier_factor_
sub6_results['negative_lof'] = neg_values
sub6_results
| VibrationPeak_mean [dB] | VibrationPeak_median [dB] | VibrationCarpet_mean [dB] | VibrationCarpet_median [dB] | program | negative_lof | |
|---|---|---|---|---|---|---|
| 0 | -6.921236 | -8.304498 | -8.264868 | -5.481411 | 1 | -1.056637 |
| 1 | -6.470144 | -7.244804 | -2.106002 | -10.807974 | 2 | -1.129181 |
| 2 | 34.503162 | 35.762915 | 27.908133 | 22.816784 | 2 | -0.963585 |
| 3 | 22.144701 | 36.509675 | 28.524839 | 28.941813 | 2 | -1.369928 |
| 4 | -5.823500 | -4.848217 | -1.193132 | 2.436969 | 1 | -1.430831 |
| ... | ... | ... | ... | ... | ... | ... |
| 83 | -5.548868 | -6.355704 | -10.301523 | -5.439592 | 1 | -1.155477 |
| 84 | -5.811831 | -4.334031 | -0.468306 | -8.513778 | 1 | -1.156457 |
| 85 | -1.294405 | -7.711131 | -5.414103 | -3.745550 | 1 | -1.120919 |
| 86 | -5.099513 | -8.520697 | -13.016096 | -7.445794 | 1 | -1.208758 |
| 87 | -2.897968 | -6.233262 | -4.108349 | -4.903356 | 1 | -0.981466 |
88 rows × 6 columns
for i,col1 in enumerate(sub6_results.columns):
for j,col2 in enumerate(sub6_results.columns):
for k,col3 in enumerate(sub6_results.columns):
if i!=j and i<j and j!=5 and k==4:
plt.figure(figsize=(10,10))
sns.scatterplot(x=col1, y=col2, data=sub6_results, hue="negative_lof", palette="rocket",
edgecolor="black", s=150)
plt.title('{} vs {}'.format(col1,col2))
plt.show()
You can see that points that doesn't belong to their program cluster where not marked as outliers. Any idea why? Think of the way LOF is calculated. Try to make it better (see better results below).
sub7 = sub6.copy()
lof2 = LocalOutlierFactor(n_neighbors=10, contamination='auto', novelty=False)
lof2.fit(sub7)
lof2.negative_outlier_factor_
sub7_results = sub7.copy()
neg_values2 = lof2.negative_outlier_factor_
sub7_results['negative_lof'] = neg_values2
for i,col1 in enumerate(sub7_results.columns):
for j,col2 in enumerate(sub7_results.columns):
for k,col3 in enumerate(sub7_results.columns):
if i!=j and i<j and j!=5 and k==4:
plt.figure(figsize=(10,10))
sns.scatterplot(x=col1, y=col2, data=sub7_results, hue="negative_lof", palette="rocket",
style="program", edgecolor="black", s=150)
plt.title('{} vs {}'.format(col1,col2))
plt.show()
#For me, assigning a shape to each program makes it easier to visualize outliers. (complementing what LOF can't detect)